Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation
نویسندگان
چکیده
Our goal is to create speaker models in audio domain and face models in video domain from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question ”Who was speaking and when”) and/or for face recognition (”Who was seen and when”) for given videos that contain speaking persons. The proposed system is based on an audio-video diarization system that tries to resolve the disadvantages of the individual modalities. Experiments on broadcasts of Czech parliament meetings show that the proposed combination of individual audio and video diarization systems yields an improvement of the diarization error rate (DER).
منابع مشابه
Speaker Diarization Using Gaussian Mixture Turns and Segment Matching
Speaker diarization aims to detect “who spoke when” in large audio segments. It is an important task in processing of broadcast news audio, making easier the audio segments selection and indexing task. In this paper an unsupervised speaker diarization scheme is proposed using a Gaussian Mixture Model as a Universal Background Model, Bayesian Information Criterion and fingerprint detection. A de...
متن کاملUsing Weighted Oriented Optical Flow Histograms for Multimodal Speaker Diarization
Speaker diarization currently focuses on using audio features to partition an audio stream into speaker homogeneous speech regions, in other words to determine “who spoke when”. Recent speaker diarization corpora contains video recordings in addition to the commonly used audio. Thus, we investigated the benefits of incorporating video features, namely histograms of weighted oriented optical flo...
متن کاملInteger linear programming for speaker diarization and cross-modal identification in TV broadcast
Most state-of-the-art approaches address speaker diarization as a hierarchical agglomerative clustering problem in the audio domain. In this paper, we propose to revisit one of them: speech turns clustering based on the Bayesian Information Criterion (a.k.a. BIC clustering). First, we show how to model it as an integer linear programming (ILP) problem. Its resolution leads to the same overall d...
متن کاملTowards Audio-Visual On-line Diarization Of Participants In Group Meetings
We propose a fully automated, unsupervised, and non-intrusive method of identifying the current speaker audio-visually in a group conversation. This is achieved without specialized hardware, user interaction, or prior assignment of microphones to participants. Speakers are identified acoustically using a novel on-line speaker diarization approach. The output is then used to find the correspondi...
متن کاملUnsupervised Compensation of Intra-Session Intra-Speaker Variability for Speaker Diarization
This paper presents a novel framework for unsupervised compensation of intra-session intra-speaker variability in the context of speaker diarization. Audio files are parameterized by sequences of GMM-supervectors representing overlapping short segments of speech. Session-dependent intra-session intra-speaker variability is estimated in an unsupervised manner, and is compensated using the nuisan...
متن کامل